This document summarizes various data-related observations from Round 1 of Americas East Open Division 1 - C category. The dataset includes all races except Race 3 - the team time trial race. All data was collected from Zwift Power.
Below is the distribution of number races competed in by rider. There were 126 riders. No rider competed in all 5 of 5 races.
The below Cleveland plot shows the finishing positions of different riders (the rider IDs have been anonymized). It’s interesting that almost none of the riders placed consistently in the top 10, or even top 20. The ones that tended to only do a couple races. Many riders had a very wide range of results. It does not appear to be uncommon for a rider to get in the top 5 in one race and 50th+ in another. Of the 126 riders, only 16 placed in the top 20 or better in each other their races. Appologies for the long plot, but there’s a lot of data to display - there are interactive tools to zoom and filter by race in the top right.
Not presented here for the sake of space, but a quick check was done on the change in weight riders reported across their races. It does not appear as though there were any suspcious ‘jumps’ in weight. I had thought with the 50% rule, that some riders might lower their weights after the 3rd race to get a boost without triggering the 3.2 ceiling rule. This does not appear to be the case.
For each race, riders were grouped into whether or not they finished in the top 20. A random sample of 230 results was used to train a random forest model. Only “targetable” explanatory variables were used (i.e. Normailzaed W/Kg is not something a rider can directly target so it was excluded). The model was used to predict the whether the remaining 77 results would be a top 20 finish. The model had an overal predictive accuracy of 0.8441558.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 54 7
## Yes 5 11
##
## Accuracy : 0.8442
## 95% CI : (0.7436, 0.9168)
## No Information Rate : 0.7662
## P-Value [Acc > NIR] : 0.06474
##
## Kappa : 0.5475
##
## Mcnemar's Test P-Value : 0.77283
##
## Sensitivity : 0.9153
## Specificity : 0.6111
## Pos Pred Value : 0.8852
## Neg Pred Value : 0.6875
## Prevalence : 0.7662
## Detection Rate : 0.7013
## Detection Prevalence : 0.7922
## Balanced Accuracy : 0.7632
##
## 'Positive' Class : No
##
The same model approach was used to identify Top 10, Top 30, Top 40. The model’s predictive power declined from Top 10 to Top 40, which suggests that the relative difference in rider power profiles between the someone who is 15th versus 25th is more different than a rider that finishes 35th versus a rider that finishes 50th. The Top 20 group was chosen as a ‘good balance’. The goal isn’t necessarily to answer the question: “How do I get in the top 20?”, rather it’s “What are the metrics can I target in my training that will give me the best returns?”. The top 10 variables are shown below:
We can see that short-term W/Kg is carries substancially more importance than the other variables. Taking a deeper look into the top 4 variables:
It’s interesting that the importance of these variables is very non-linear. For the 30s W/Kg, it only contributes to improving your odds of getting in the top 20 if you have a value of more than 5.